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Guesswork, large deviations and Shannon entropy 
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Abstract — How hard is it guess a password? Massey showed 
that the Shannon entropy of the distribution from which the 
password is selected is a lower bound on the expected number 
of guesses, but one which is not tight in general. In a series 
of subsequent papers under ever less restrictive stochastic as- 
sumptions, an asymptotic relationship as password length grows 
between scaled moments of the guesswork and specific Renyi 
entropy was identified. 

Here we show that, when appropriately scaled, as the password 
length grows the logarithm of the guesswork satisfies a Large 
Deviation Principle (LDP), providing direct estimates of the 
guesswork distribution when passwords are long. The rate func- 
tion governing the LDP possess a specific, restrictive form that 
encapsulates underlying structure in the nature of guesswork. 
Returning to Massey's original observation, a corollary to the 
LDP shows that expectation of the logarithm of the guesswork is 
the specific Shannon entropy of the password selection process. 

Index Terms — Guesswork, Renyi Entropy, Shannon Entropy, 
Large Deviations 

I. Introduction 

If a password, W, is chosen at random from a finite set 
A = {1, . . . , m}, how hard is it to guess Wl If {P(W = w)} 
is known, then an optimal sttategy is to guess passwords in 
decreasing order of probability. Let G(w) denote the num- 
ber of attempts required before correctly guessing w € A, 
called w's guesswork. Massey H) proved that the Shannon 
entropy of W is a lower bound on the expected guesswork, 
E(G(W)), and that no general upper bound exists. This 
raised serious questions about the appropriateness of Shannon 
entropy as a measure of complexity of a distribution with 
regards guesswork. As a corollary to stronger results, in this 
article we identify a large password relationship between the 
expectation of the logarithm of the guesswork and specific 
Shannon entropy. 

Arikan [2| introduced an asymptotic regime for studying 
this problem by considering a sequence of passwords, {W^}, 
with W k chosen from A fe with i.i.d. letters. Again guessing 
potential passwords in decreasing order of probability for 
each fc, he related the asymptotic fractional moments of the 
guesswork to the Renyi entropy of a single letter, 

lim - log E{G(W k ) a ) = (l + a)logVP(H/i= W )^ 

fc->oo k * — ' 

for a > 0, where the right hand side is a times the 
Renyi entropy of W\ evaluated at 1/(1 + a). This result 
was subsequently extended by Malone and Sullivan (3) to 
word sequences with letters chosen by a Markov process and, 
further still, by Pfister and Sullivan to sophic shifts whose 
shift space satisfies an entropy condition and whose marginals 
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possess a limit property. Recently, using a distinct approach 
Hanawal and Sundaresan Q provided alternate sufficient 
conditions for the existence of the limit. In all cases, the limit 
is identified in terms of the specific Renyi enttopy 



lim j\ogE(G(W k ) a ) = a lim jR k (— J— 

k— s-oo fc fc->oo k V 1 + a 

where Rk(a) is the Renyi entropy of W k 



(1) 



Rk{a) 



1 



log J2 P(Wk = w) c 



Here we shall assume the existence of the limit on the left 
hand side of equation ([TJ for all a > — 1, its equality with a 
times specific Renyi entropy, its differentiability with respect 
to a in that range and a regularity condition on the probability 
of the most-likely word, that lim fc -1 \ogP(G{W k ) = 1) 
exists. From this, Theorem [3] deduces that the sequence 
{A; -1 log G(W k )} satisfies a Large Deviation Principle (LDP) 
(e.g. [6 1) with a rate function A* that must possess a specific 
form that will have a physical interpretation: A* is continuous 
where finite, can be linear on an interval [0,a], for some 
a € [0, log(m)], and then must be strictly convex while finite 
on [a, log(m)]. 

In contrast to earlier results, Corollary 0to the LDP gives 
direct estimates on the guesswork distribution P(G(Wk) — n) 
for large k, suggesting the approximation 
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P(G(W k ) =n)m -exp(-A:A*(fc- 1 logn)). 



(2) 



As this calculation only involves the determination of A*, to 
approximately calculate the probability of the n th most likely 
word in words of length k one does not have to identify 
the word itself, which would be computationally cumbersome, 
particularly for non-i.i.d. word sources. 

Corollary [5] to the LDP recovers a role for Shannon entropy 
in the asymptotic analysis of guesswork. It shows that the 
scaled expectation of the logarithm of the guesswork converges 
to specific Shannon entropy 



lim -E(\ogG(W k )) 



lim -H(W k ), 

k— >oc K 



where 



H(W k ) := p ( w k = w)\og P(W k = w). 



II. A Large Deviation Principle 



Consider the 
{fc^logG^)}. 



sequence 
Our starting 



of random variables 
point is the observation 
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that the left hand side of © is the scaled Cumulant 
Generating Function (sCGF) of this sequence: 



A(a) := lim ylogE fe Qlog 



G(W k ) 



which is shown to exist for a > in [2|[3| and for a > — 1 
in E). 

Assumption 1: For a > — 1, the sCGF A(a) exists, is equal 
to a times the specific Renyi entropy, and has a continuous 
derivative in that range. 

We also assume the following regularity condition on the 
probability of the most likely word. 
Assumption 2: The limit 



Si= lim -logP(G(W fe ) = l) 

k— >oo K 



(3) 



exists in (— oo, 0]. 

This assumption is transparently true for words constructed of 
i.i.d. or Markovian letters. 

We first show that the sCGF exists everywhere. 

Lemma 1 (Existence of the sCGF): Under assumptions Q] 
and El for all a < -1 

A(a) = lim \\ogP{G{W k ) =l)= 9l = lim A(/3). 
Proof: Let a < — 1 and note that 

log P(G(W k ) = 1) < log^P(G(W fe ) = 

i=l 

oo 

= log£;(e QlosG(M/fc) ) < log P{G{W k ) = l) + log^i a . 

i=l 

Taking liminffc^oo fc" 1 with the first inequality and 
limsup^^ fc _1 with the second while using the Principle 
of the Largest Term, J6] Lemma 1.2.15] and usual estimates 
on the harmonic series, we have that 

lim y\ogE(e alosG{Wk) ) = lim \ log P(G(W k ) = 1) 

k— too k k— >oo k 

for all a < — 1. 

As A is the limit of a sequence of convex functions 
and is finite everywhere, it is continuous and therefore 
lim w _iAC8)=A(-l). ■ 
Thus the sCGF A exists and is finite for all a, with a potential 
discontinuity in its derivative at a = —1. This discontinuity, 
when it exists, will have a bearing on the nature of the rate 
function governing the LDP for {fc _1 log G(W k )}. Indeed, the 
following quantity will play a significant role in our results: 



7 := lim — — A(a) 
ai-i da 



(4) 



We will prove that the number of words with approximately 
equal highest probability is close to exp(fc7). In the special 
case where the {Wfc} are constructed of i.i.d. letters, this is 
exactly true and the veracity of the following Lemma can be 
verified directly. 



Lemma 2 (The number of most likely words): If {Wfe} are 
constructed of i.i.d. letters, then 

7= lim -^-aRUtl + a)^ 1 ) 
ai-i da 

= \og\{w : P{W t =w) = P(G(W!) - 1)}|, 

where | • | indicates the number of elements in the set. 
This i.i.d. result doesn't extend directly to the non-i.i.d. case 
and in general Lemma [2] can only be used to establish a lower 
bound on 7: 

7= lim — A(a) > lim sup lim — — aR k ((l + a) -1 ), (5) 
a|— 1 da k^-oa «4—i da 

e.g Q Theorem 24.5]. This lower bound can be loose, as can 
be seen with the following example. Consider the sequence of 
distributions for some e > 



P{W k = i) 



m fe (l + e) 
m _fe (l — e(m k 



if i=l 
otherwise. 



For each fixed k there is one most likely word and we have 
log(l) = on the right hand side of equation (0 by Lemma 
[2] The left hand side, however, gives log(m). Regardless, 
this intuition guides our understanding of 7, but the formal 
statement of it approximately capturing the number of most 
likely words will transpire to be 

gi= lim -log inf P(Wk=w), 

k-yoo K {w:G(w)<cxp(k~/)} 

where g\ is defined in equation Q. 

We define the candidate rate function as the Legendre- 
Fenchel transform of the sCGF 



A*(x) := 



sup{:ra — A(a)} 
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su PaeR 



{xa — A(a)} 



if x e [0,7] 

if x € (7, log(m)]. 



The LDP cannot be proved directly by Baldi's version of the 
Gartner-Ellis theorem ||8]|j6] Theorem 4.5.20] as A* does not 
have exposing hyper-planes for x € [0,7]. Instead we use a 
combination of that theorem with the methodology described 
in detail in [9] where, as our random variables are bounded 
< fc _1 log G(W k ) < log(m), in order to prove the LDP 
it suffices to show that the following exist in [0, 00] for all 
x € [0, logm] and equals — A*(x): 

lim lim inf i log P ( i log(G(W fc )) € BJx) 

e40 fc->oo k \k 



= lim lim sup - log P (-log(G(Wfe)) £ B e (x) 



(6) 



where B e (x) = (x — e, x + e). 

Theorem 3 (The large deviations of guesswork): Under as- 
sumptions [T] and the sequence {fc _1 log G(W k )} satisfies a 
LDP with rate function A*. 

Proof: To establish (O we have separate arguments 
depending on x. We divide [0, log(m)] into two parts: 
[0,7] and (7, log(m)]. Baldi's upper bound holds for any 
x € [0, log(m)]. Baldi's lower bound applies for any x € 
(7, log(m)] as A* is continuous and, as A(a) has a continuous 
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derivative for a > — 1, it only has a finite number of points 
without exposing hyper-planes in that region. For x £ [0,7], 
however, we need an alternate lower bound. 
Consider x € [0,7] and define the sets 

K k {x,e) := {w £ A fe : k' 1 log G(w) £ B e (x)} , 

letting \K k (x,e)\ denote the number of elements in each set. 
We have the bound 

\K k (x,e)\ inf P(W k =w) 



(7) 



<P[ -logG(W k ) £ B t (x) 



As \e ki - x ~^\ < \K k {x,e)\ < \e k ( x+e ^, we have that 

x = lim lim — log \K k (x, e)|. 

By Baldi's upper bound, we have that 

lim lim sup y \ogP ( \ log G(Wk) € BJx) | < a; + 91. 
e4-0 fc ^oo k \k ) 

Thus to complete the argument, for the complementary lower 
bound we need to show that for any x £ [0,7] 

lim lim inf inf — logP(W / fc — w) = g%. 

e40 fc— >oo u)6_fsr fc (x,e) K 

If A*(x) < 00 for some x > 7, then for e > sufficiently 
small let x* be such that A* (a;*) < 00 and x* — e > 
max(7,x + e). Then by Baldi's lower bound, which applies 

as x* £ (7, log(m)], we have 

- inf A*(y) < lim inf \ logP f \ log G(W k ) £ BJx*) 
Now 



P ^-logG(W k ) £ B e (x*) 
<\K k (x*,e)\ sup P(W fc = «;) 

w£K k (x* ,e) 

<\K k (x*,e)\ inf P(VK fc = W ), 

w£K k (x,e) 

where in the last line we have used the monotonicity of 
guesswork and the fact that x* — e > x + e. Taking lower 
limits and using equation (0 with \K k (x* , e)\, we have that 

— inf A*(y) < x* + lim inf inf \\ogP(W k = w) 

yeB e (x") fe-s-oo w£K k (x,e) k 

for all such x* , x. Taking limits as t | and then limits as 
x* I 7 we have 

— lim A* (a;*) < 7 + lim lim inf inf — log P(W k = w), 

x'ly ej.0 fe— >cxo w£K k (x,e) k 

but limj;»4_ 7 A* (a;*) = -7 - cji so that 



lim lim inf inf — log P(W k = w) = g\, 

e40 fc-s-oo w£K k (x,e) k 



Only one case remains: if A*(x) = 00 for all x > 7, then 
we require an alternative argument to ensure that 

lim inf inf — logP(Wfc — w) = g±. 

k—>oo w€LK k (x,e) k 

This situation happens if, in the limit, the distribution of 
words is near uniform on the set of all words with positive 
probability. Thus define 

[i := lim sup - log \{w : P(W k = w) > 0}|. 

k— > 00 

As A*(x) = 00 for all x > 7, [i < 7. To see 7 = /x, note that 
7 = lim Q4 ._i A' (a) < A'(0). As both A(a) and aR k ((l + 
a) -1 ) are finite and differentiable in a neighborhood of 0, by 
Theorem 25.7] 

A'(0)= lim i-^-aR k ((l + a)- 1 )\ a=0 = lim jH(W k ). 

k~ yoo K act k— ¥00 K 

and lim^oo k~ 1 H(W k ) < fi. Thus 7 = fj, and, due to 
convexity, A is linear with slope /1 on a € (—1,0]. As 
A(0) = 0, using Lemma [TJ we have that gi = —fj,. Let x < fi 
and consider 

I = lim sup sup j log P(W k = w) 

fe^oc wEK k (x+2e,e) 

< lim inf inf j log P(W k = w). 

k— >oo w£K k {x,e) k 

We shall assume that / < g\ and show this results in a 
contradiction. Let e < min(g 1 —l,fi — x)/2, then there exists 
N e such that 

P(W k =w)< e k{x+f - ] e k[gi+t) + e fc ^+ e ) e fc ('+ c ) 



fc(-p+x+2e) , e fc(- 9l +i+2e) 



as required. 



for all fc > N e , but this is strictly less than 1 for k sufficiently 
large and thus I = g%. Finally, for x = fi, and e > 0, note that 
we can decompose [0, log(m)] into three parts, [0, fi — e] U (^1 — 
e, fi + e) U [fi + e, log(m)], where the scaled probability of the 
guesswork being in either the first or last set is decaying, but 

= lim ilogP (h og G(W k ) £ [0,log(m)] 

k— >oo K V fc 

and so the result follows from an application of the principle 
of the largest term. 

Thus for any x £ [0, log(m)], 

lim lim inf \ log P f^bg(Gf^)) £ BJx) 
e4.o fc-i-oc k \k 

= lim lim sup- log P (-log(G[W k )) £ B e (x) 
e 4-0 fc— >cxo k \k 

= -A*(x) 

and the LDP is proved. ■ 
In establishing the LDP, we have shown that any rate 
function that governs such an LDP must have the form of 
a straight line in [0, 7] followed by a strictly convex function. 
The initial straight line comes from all words that are, in an 
asymptotic sense, of greatest likelihood. 
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While the LDP is for the sequence {fc -1 log G(Wk)}, it can 
be used to develop the more valuable direct estimate of the 
distribution of each G(W k ) found in equation (0. The next 
corollary provides a rigorous statement, but an intuitive, non- 
rigorous argument for understanding the result therein is that 
from the LDP we have the approximation that for large fc 



dP I - log G( Wfc) = xj « exp(-fcA*(»). 

As for large fc the distribution of fc -1 log G(Wk) and 
G(Wk)/k are ever closer to having densities, using the change 
of variables formula gives 

dP QcW) = z) = ±-dP (I \ogG(W k ) = x 

rj - — cxp ( — fcA* ( —\og(kx) 
kx \ \k 

Finally, the substitution kx = n gives the approximation in 
equation (|2). To make this heuristic precise requires distinct 
means, explained in the following corollary. 

Corollary 4 (Direct estimates on guesswork): Recall the 
definition 

K k {x,e) := {w € A fc : jfe" 1 logG(u;) € B,(x)} . 
For any x € [0,log(m)] we have 

limliminf — log inf P(W k = w) 

elQ fc-s-oo k w£K k (x,e) 

lim lim sup — log sup P(Wk = w) 



-{x + A*(x)). 



w£Kk(x,e) 



Proof: We show how to prove the upper bound as the 
lower bound follows using analogous arguments, as do the 
edge cases. Let x g (0, log(m)) and e > be given. Using 
the monotonicity of guesswork 

lim sup — log sup P(Wk = w) 

< lim inf — log inf P(W k = w). 

fc->oo k wGK k (x-2e,e) 

Using the estimate found in Theorem [3] and the LDP provides 
an upper bound on the latter: 

(x — 3e) + lim inf — log inf P(W k = w) 

fc-»oo k w£Kk(x—2e,e) 

< lim inf ylogP [ -log(G(W fe )) € B e (x- 2e) 

< lim sup i log P (jJog(G(W k )) € [x-3e,x-e] 

fc— >oo K \K 

< - inf A*(». 

x£.[x — 3e.x — e] 

Thus 

lim sup — log sup P(Wk = w) 
<-x + 3e- inf A*(x). 

x£\x — 3e,x — e] 



5 
ll 

3 




2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 1.4e+07 1.6e+07 
G(w) 

Fig. 1. Illustration of Corollary |4] Words constructed from i.i.d letters with 
P(Wi = 1) = 0.4,P(Wi = 2) = 0.4,P(Wi = 3) = 0.2. For fc = 15 
comparison of the probability of n th most likely word and the approximation 

l/nexp(— fcA*(fc~ 1 logn)) versus n € {1, . . . , 3 15 }. 



Thus the upper-bound follows taking e J, and using the 
continuity when finite of A*. ■ 

Unpeeling limits, this corollary shows that when k is large 
the probability of the n th most likely word is approximately 
l/nexp(— kA* (fc -1 logn)), without the need to identify the 
word itself. This justifies the approximation in equation ©, 
whose complexity of evaluation does not depend on fc. We 
demonstrate its merit by example in Section Hill 

Before that, as a corollary to the LDP we find the following 
role for the specific Shannon entropy. Thus, although Massey 
established that for a given word length the Shannon entropy 
is only a lower bound on the guesswork, for growing password 
length the specific Shannon entropy determines the linear 
growth rate of the expectation of the logarithm of guesswork. 

Corollary 5 ( Shannon entropy and guesswork): Under as- 
sumptions Q] and |2] 



lim jE(logG(W k )) 

k— ¥00 K 



lim -H(W k ), 

k— >oc K 



the specific Shannon entropy. 

Proof: Note that A*(x) = if and only if x = A'(0) = 
lim k~ 1 H(W k ), by arguments found in the proof of Theorem 
|3] The weak law then follows by concentration of measure, 
e.g. ED- ■ 

III. Examples 

I.i.d letters. 

Assume words are constructed of i.i.d. letters. Let W\ take 
values in A = {1, ... , m} and assume P(W\ =i)> P{W\ = 
j) if i < j- Then from [2|, |4| and Lemma Q] we have that 

f (1 + a) log ^ P(Wi =w) 1 / {1+a ^ ifa>-l 
A(a) = I wga 

[logP(Wi = l) ifa<-l. 

From Lemma [2] we have that 

7 = lim A' (a) G {0, log(2), . . . , log(m)} 

a4-— 1 

and no other values are possible. Unless the distribution of 
W\ is uniform, A* (a;) does not have a closed form for all 
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-0.9 




-1.3 I ' ' ' ' ' * 1 

0.2 0.4 0.6 0.8 1 1.2 

1/k log G(w) 

Fig. 2. Illustration of Corollary [4] Words constructed from i.i.d letters with 
P(Wi = 1) = 0.4,P(Wi = 2) = 0.4,P(Wi = 3) = 0.2. For k = 
10, 20 and 100, comparison of fc" 1 times the logarithm of the probability of 
n th most likely word versus fc — 1 times the logarithm of n, as well as the 
approximation —x — A* (a;) versus x. 



0.7 






0.6 




H°g(2) 






Y= og d) 


0.5 




Y= log(1) 




0.4 






0.3 






0.2 






0.1 






n 







0.1 0.2 0.3 0.4 0.5 0.6 0.7 

X 

Fig. 3. Illustration of rate functions in Theorem [5] Words constructed from 
Markov letters on |A| = 2. Three rate functions illustrating only values of 7 
possible, log(l), log(</>) 0.48 and log(2), from Lemma[6] 



x, but is readily calculated numerically. With |A| = 3 and 
k = 15, Figure [T] compares the exact distribution P(Wk = 
w) versus G(w) with the approximation found in equation 
(O. As there are 3 15 « 1.4 million words, the likelihood of 
any one word is tiny, but the quality of the approximation 
can clearly be seen. Rescaling the guesswork and probabilities 
to make them comparable for distinct k, Figure [2] illustrates 
the quality of the approximation as k grows. By k — 100 
there are 3 100 « 5.1 times 10 47 words and the underlying 
combinatorial complexities of the explicit calculation become 
immense, yet the complexity of calculating the approximation 
has not increased. 

Markovian letters. 

As an example of words constructed of correlated letters, 
consider {PFfe} where the letters are chosen via a process 
a Markov chain with transition matrix P and some initial 
distribution on |A| = 2. Define the matrix P a by (P a )ij = 



pl^ 1+a \ then by 0, flU and Lemma Q] we have that 

( (1 + a) log p(P a ) ifa>-l 
|logmax(pi i:L ,p2,2, s jP\,2P2.\) if a < -1, 

where p is the spectral radius operator. In the two letter 
alphabet case, with (3 = 1/(1 + a) we have that p{P(i-p)/p) 
equals 

p^+pj, ^(pji-pj2) 2 +^-P2^^-p^y 

2 + 2 
As with the i.i.d. letters example, apart from in special cases, 
the rate function A* cannot be calculated in closed form, 
but is readily evaluated numerically. Regardless, we have the 
following, perhaps surprising, result on the exponential rate of 
growth of the size of the set of almost most likely words. 

Lemma 6 (The Golden Ratio and Markovian letters): For 
{Wit} constructed of Markovian letters, 

7= lim A'(a)e{O,log(0),log(2)}, 

a^— 1 

where <j> = (1 + v / 5)/2 is the Golden Ratio, and no other 
values are possible. 

This lemma can be proved by directly evaluating the 
derivative of A(a) with respect to a. Note that here cxp(k-f) 
definitely only describes the number of words of equal highest 
likelihood when k is large as the initial distribution of the 
Markov chain plays no role in 7's evaluation. 

The case where 7 = log(2) occurs when p\ t \ = ^2,2 = 1/2. 
The most interesting case is when there are approximately 
(fc k approximately equally most likely words. This occurs if 
Pi,i = y/pi,2p2,i > P2,2- For large k, words of near-maximal 
probability have the form of a sequence of Is, where a 2 can 
be inserted anywhere so long as there is a 1 between it and 
any other 2s. A further sub-exponential number of aberrations 
are allowed in any given sequence. For example, with an 
equiprobable initial distribution and k = 4 there are 8 most 
likely words (1111, 1112, 1121, 1211, 1212,2111,2121,2112) 
and 4 w 6.86. 

Figure [3] gives plots of A* (x) versus x illustrating the full 
range of possible shapes that rate functions can take: linear, 
linear then strictly convex, or strictly convex, based on the 
transition matrices 

/0.5 0.5\ /0.6 0.4\ (0.85 0.15\ 
^0.5 O^y'^.g O.l) [o.lb 0.85/ 

respectively. 
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